# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
# load in the dataset into a pandas dataframe
bikes = pd.read_csv('201902-fordgobike-tripdata.csv')
#viewing the first 5 rows of the dataset
bikes.head()
# checking for the last 5 ROWS
bikes.tail()
#structure of the dataset
bikes.shape
This dataset is made up of 183412 ROWS and 16 COLUMNS
#checking for columns with missing data
bikes.info()
There are 183412 entries present in the dataset,with missing data in the several columns such as start_station_id,start_station_name,end_station_id,end_station_name,member_birth_year and member_gender.
#DataTypes present in the dataset
bikes.dtypes
The bike dataset spread across 7 (float64),7 (object) and 2(int64) dataTypes.
# descriptive statistics for numeric variables
bikes.describe()
#checking for the total sum of missing data present in the dataset columns
missing_data = bikes.isnull().sum()
missing_data
It is seen that we have missing data across 6 different columns in the dataset,which comprises of :
8265 data missing in both member_birth_year and member_gender.
197 data missing in 4 different columns start_station_id,start_station_name,end_station_id and start_station_name.
bikes.duplicated().sum()
NO DUPLICATE IN THE DATASET
# percentage of missing data present
total_cell = np.product(bikes.shape)
total_missing_data = missing_data.sum()
(total_missing_data/total_cell) * 100
The percentage(%) of the missing data is not up to 1%.....so therefore dropping the ROWS with missing data wouldn't be a Bad idea.
bikes1 = bikes.copy()
Changing the DataTypes of some Certain columns such as:
*start_station_id from float to object.
*end_station_id from float to object.
*member_birth_year from float to object.
bikes1['start_station_id'] = bikes1['start_station_id'].astype(object)
bikes1['end_station_id'] = bikes1['end_station_id'].astype(object)
bikes1['member_birth_year'] = bikes1['member_birth_year'].astype(object)
print(bikes1['start_station_id'].dtypes)
print(bikes1['end_station_id'].dtypes)
print(bikes1['member_birth_year'].dtypes)
Convert the DataType of these columns to order Categorical dtpyes
*user_type from object to categorical.
*member_gender from object to categorical.
*bike_share_for_all_trip from object to categorical.
ordinal_var_dict = {'user_type': ['Customer','Subscriber'],
'member_gender': ['Male', 'Female','Other'],
'bike_share_for_all_trip': ['No', 'Yes']}
for i in ordinal_var_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = ordinal_var_dict[i])
bikes1[i] = bikes1[i].astype(ordered_var)
print(bikes1['user_type'].dtypes)
print(bikes1['member_gender'].dtypes)
print(bikes1['bike_share_for_all_trip'].dtypes)
bikes1.info()
Extracting Date from the start_time and end_time column
bikes1['start_date'] = pd.to_datetime(bikes1['start_time']).dt.date
bikes1['end_date'] = pd.to_datetime(bikes1['end_time']).dt.date
# conversion of start_date/end_date object to datetime dtype
bikes1['start_date'] = pd.to_datetime(bikes1['start_date'])
bikes1['end_date'] = pd.to_datetime(bikes1['end_date'])
bikes1.head()
bikes1.info()
Dropping every missing data present in the dataset.
# remove all the rows that contain a missing value
bikes1 = bikes1.dropna()
bikes1
bikes1.shape
bikes1.info()
Convert member_birth_year to member_age
# using now == 2019,because that was when the bike trip occured in the dataset
now = 2019
# 2. Create ages
bikes1['member_age'] = bikes1['member_birth_year'].apply(lambda x: 2019 - x)
#convert to integer
bikes1['member_age'] = bikes1['member_age'].astype('int')
#bikes = bikes
bikes1.head()
bikes1.info()
bikes1.describe()
#saving new datasets to csv
bikes1.to_csv('cleaned bike dateset.csv',index = False)
There are 183412 ROWS present in the bikes' dataset which is featured with 16 COLUMNS comprises of (duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude, start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude, bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip).Majority of the variables present in this dataset are Numeric in nature, except for start_station_name and end_station_name which depicts address of the start station and end station,with categorical variables such as:
* user_type:['Customer','Subscriber'] * member_gender:['Male', 'Female','Other'] * bike_share_for_all_trip: ['No', 'Yes']Though after cleaning my data new features where added such as Member_age,Start_date and End_Date.
What is/are the main feature(s) of interest in your dataset?¶
My main features of interest is knowing the start time/end time of bike trip duration,also knowing the location of most frequent bike start station/end station and the major charateristics that can affect the duration of bike trip such as age,gender user_type and bike_share_for_all_trip.
Firstly duration trip to depict location(address) of frequent start bike station,age,gender and user type in which Bike was shared the most.
# top ten start_date value_counts
date_top10 = bikes1.start_date.value_counts()[:10]
date_top10
#barplot showing Top 10 start Date with the most shared bike in SF
base_color = sb.color_palette()[0]
date_top10 = bikes1.start_date.value_counts()[:10].index
sb.countplot(data = bikes1, x = 'start_date', color = base_color, order = date_top10)
plt.xticks(rotation = 90)
plt.xlabel('Count of Shared Bikes')
plt.ylabel('Trip start Date')
plt.title('Top 10 start Date with the most distributed bike in SF');
The above bar-plot shows a total value count for Top 10 dates ONLY in the month of FEBRUARY with most Bike shared day on the 28th of FEBRUARY 2019 has more value_count with 9448 counts (maybe because it was the last day of the month) and 5th of FEBRUARY with the lowest value_count with 8136 counts
# top ten most common start station that distributed bikes the more
start_station_top10 = bikes1.start_station_name.value_counts()[:10]
start_station_top10
#bar plot showing ten most common start station for Bike distribution
start_station_top10 = bikes1.start_station_name.value_counts()[:10].index
base_color = sb.color_palette()[0]
sb.countplot(data = bikes1, y = 'start_station_name', color = base_color, order = start_station_top10)
plt.xticks(rotation = 90)
plt.xlabel('Count of shared Bikes')
plt.ylabel('Start Station Name')
plt.title('Top 10 Stations with the most shared Bikes in SF');
The above bar-plot shows Top 10 distribution of Start Station Name(location name) with most common start bike trip in San Fracisco with "Market St at 10th St" location having a total value_count of 3649 and "Powell St BART Station (Market St at 5th St)" with 2144 counts.
#value count for member_gender
gender_counts = bikes1['member_gender'].value_counts()
#arranging the gender counts in descending order with .index
gender_order = gender_counts.index
#calculating the max_proportion in gender_counts
n_bikes = bikes1.shape[0]
max_gender_count = gender_counts[0]
max_prop = max_gender_count / n_bikes
print(max_prop)
# Create an array of evenly spaced proportioned values
tick_props = np.arange(0,max_prop+0.1,0.1)
tick_props
#Create a list of String values that can be used as tick labels.
tick_names = ['{:0.2f}'.format(v) for v in tick_props]
tick_names
#Plot the bar chart, with new x-tick labels
sb.countplot(data=bikes1, y='member_gender', color=base_color, order=gender_order);
# Change the tick locations and labels
plt.xticks(tick_props * n_bikes, tick_names)
plt.xlabel('proportion');
#Print the text (proportion) on the bars of a horizontal plot.
base_color = sb.color_palette()[0]
sb.countplot(data=bikes1, y='member_gender', color=base_color, order=gender_order);
# Logic to print the proportion text on the bars
for i in range (gender_counts.shape[0]):
# Remember, gender_counts contains the frequency of unique values in the `member_gender` column in decreasing order.
count = gender_counts[i]
# Convert count into a percentage, and then into string
pct_string = '{:0.1f}%'.format(100*count/n_bikes)
plt.text(count+1, i, pct_string, va='center')
plt.title('Proportion of Bike shared for Gender')
plt.xlabel('Count of Bike shared')
Yes....The above horizontal bat-plot shows different proportion of Bike shared for Gender,with male having way more than average(74.6%) and female with 23.3% less than average and others with 2.1%
# proportion distribution of User_type using plotly
fig = px.pie(bikes1, values='duration_sec',names=bikes1['user_type'],
title='Proportions of Bikers Duration')
fig.show()
In this Visual a new python library was used called plotly which gave a descriptive proportion of Bike Trip duration(sec) for different user_type with Subscribe coming with 82.4% and Customer having 17.6%.
def histogram_solution_1():
plt.figure(figsize=(8,6))
bins = np.arange(0, bikes1['member_age'].max()+2, 2)
plt.hist(bikes1['member_age'], bins = bins)
plt.xlabel('Age (Year)')
plt.ylabel('Count')
plt.title('Age distribution of Ford GoBike data in SF');
histogram_solution_1()
This distribution of Age is a Long Right-skew plot
# there's a long tail in the distribution, so let's put it on a log scale instead
log_binsize = 0.025
bins = 10 ** np.arange(1.2, np.log10(bikes1['member_age'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 6])
plt.hist(data = bikes1, x = 'member_age', bins = bins)
plt.xscale('log')
plt.xticks([10,20,30,35,40,50,70,90,100], [10,20,30,35,40,50,70,90,100])
plt.xlabel('Age (Year)')
plt.ylabel('Count')
plt.title('Log-scale Distribution of Age in SF');
The above Histogram shows the distribution of Age in log transformation proves that majority of age that road bike occur within the ages of (35 years-42 years) which are majorly Youth age, and lesser bike_trip count for those at adult age.
def histogram_solution_2():
plt.figure(figsize=(10,6))
bins = np.arange(60,10000, 50)
plt.hist(bikes1['duration_sec'], bins = bins)
plt.xlabel('Bike Duration (Sec)')
plt.ylabel('Count')
plt.title('Distribution of Bike Duration(sec) in SF');
histogram_solution_2()
This distribution of Bike Duration(sec) is a Long Right-skew plot
# there's a long tail in the distribution, so let's put it on a log scale instead
log_binsize = 0.025
bins = 10 ** np.arange(1.4, np.log10(bikes1['duration_sec'].max())+log_binsize, log_binsize)
plt.figure(figsize=[10, 6])
plt.hist(data = bikes1, x = 'duration_sec', bins = bins)
plt.xscale('log')
plt.xticks([50, 60, 70, 100, 250, 450,600,1000,8000],
[50, 60, 70, 100, 250, 450,600,1000,8000])
plt.xlabel('Bike Duration (Sec)')
plt.title('Log-scale for Duration_sec')
plt.show()
Using a define scale shows that higher Bike duration occured more between 2.5k-10k sec
The distribution of Age was a Right-skew plot which was very jam-packed in terms of data points,and a log-transformation was inserted to show a well describe plot of data points which shows Age occured between 25-42 years of Age during bike Trip.
Yes, Data wrangling was carried out firstly because of messy/untidy data present, such as missing data,drop duplicated data and most especially convert DataTypes.
Unusual distribution were present in the aspect of Duration distributed across different time scale and was highly difficult to read data point but using log-transformation with well defined ticks majority of bike Trip duration occured between 2.5k-10k sec.
numeric_vars = ['member_age','duration_sec']
categoric_vars = ['user_type', 'member_gender', 'bike_share_for_all_trip']
# correlation plot using Heatmap
plt.figure(figsize = [8, 5])
sb.heatmap(bikes1[numeric_vars].corr(), annot = True, fmt = '.3f',
cmap = 'vlag_r', center = 1)
plt.title('Correlation between member_age and duration')
plt.show()
From the Heatmap color range shows they were NO much correlation between member_age and duration_sec
def scatterplot_solution_1():
sb.regplot(data = bikes1, x = 'duration_sec', y = 'member_age');
# plt.plot([10,60], [10,60]) # diagonal line from (10,10) to (60,60)
plt.xlabel('duration.(Sec)')
plt.ylabel('member_age. (Years)')
plt.title('Correlation between duration and member_age')
scatterplot_solution_1()
This Scatter plot shows slight positive relationship between duration_sec and member_age(Years),though we were having over plotting of data points,but from the visual shows age between 25-54 Years tends to Travel more duration Trip in sec.
# plot matrix: plotted to avoid overplotting and showing clearer vies of the numeric data.
print("bikes1.shape=",bikes1.shape)
bikes1_samp = bikes1.sample(n=500, replace = False)
print("bikes1_samp.shape=",bikes1_samp.shape)
g = sb.PairGrid(data = bikes1_samp, vars = numeric_vars)
g = g.map_diag(plt.hist, bins = 20);
g.map_offdiag(plt.scatter);
# plot matrix of numeric features against categorical features.
bikes1_samp = bikes1.sample(n=2000, replace = False)
def boxgrid(x, y, **kwargs):
""" Quick hack for creating box plots with seaborn's PairGrid. """
default_color = sb.color_palette()[0]
sb.boxplot(x=x, y=y, color=default_color)
plt.figure(figsize = [10, 10])
g = sb.PairGrid(data = bikes1_samp, y_vars = ['member_age','duration_sec'], x_vars = categoric_vars,
height = 3, aspect = 1.5)
g.map(boxgrid)
plt.show();
sb.countplot(data = bikes1, x = 'member_gender', hue = 'user_type')
plt.ylabel('counts of Shared bike')
plt.title('Relationship of Gender Associated with Riders');
The Clustered Bar-plot depicts that majority of Bikes shared in San Fracisco were Male gender and happen to be more of subscriber with very low customer type.
sb.heatmap(bikes1.corr(), annot=True);
This Heatmaps shows high correlation of color points of 0.075 between member_age and Start_station_latitude/End_station_latitude, which depicts that majority of Bikers where from the NORTH-SOUTH position of San Fancisco.
The relationship observed in this part of investigation was between 2 variables:
Firstly I looked closely the relationship between member_age and duration_sec using Heatmaps showing that the relationship between was really low with 0.006 color point,also checking out member_age and duration_sec using scatter plot show slightly positive correlation between both variables with more age occuring between 25-54 years.
Secondly,The Clustered barchart depicts that more male bikers where common in which majority where subscriber.
Yes,The other gender did not participate in the bike trip duration irrespective of whether it was a customer or subscriber user_TYPE.
# scatter plot showing a better description of relationship between duration & age by Gender
fig = px.scatter(bikes1, x="duration_sec", y="member_age", size="duration_sec" ,
color=bikes1['member_gender'],
title='Correlation between duration and member_age by Gender',log_y=True, size_max=20)
fig.show()
Plotly a python library was used here because i needed to depict every data points between duration and member_age making the size to be the duration_sec and third variable be the color to depict the gender.The scatter plot show that people whose travel more trip above 25k(sec) are between the age 25- 42 years(mostly male) and less that 20k(sec) lies between 50years-140years though there are few Youth(18-30years) who travelled less than 20k(sec).
sign_markers = [['Customer', 'o'],
['Subscriber', 's']]
for sign, marker in sign_markers:
bikes_sign = bikes1[bikes1['user_type'] == sign]
plt.scatter(data = bikes_sign, x = 'duration_sec', y = 'member_age', marker = marker)
plt.legend(['Customer','Subscriber'], title = 'user_type')
plt.xlabel('Duration(sec)')
plt.ylabel('Age(years)')
plt.title('Age and Duration by user_type');
The above scatter plot shows the relationship between member_age and duration trip in seconds, using User_type as a means of preference,from the plot it shows there are lots of data points(congestion) between subscriber and customer within the time frame of (0-20ksec) and are within the ages of (28-45years)
# Faceting in two direction to avoid overplotting
g = sb.FacetGrid(data = bikes1,col ='user_type')
g.map(plt.scatter,'duration_sec', 'member_age');
Splitting user_type into different subplot using Facet
# scatter plot showing a better description of relationship between duration & age by user_type
fig = px.scatter(bikes1, x="duration_sec", y="member_age", size="duration_sec" ,
color=bikes1['user_type'],
title='Correlation between duration and member_age associated with Riders',log_y=True, size_max=20)
fig.show()
The above scatter plot using plotly shows that majority of the Riders are who travelled more trip in seconds are those Subscribe Type within the age of 28-57years.
The first Relationship observed here is the Correlation between member_age and Duration by Gender and also by user_type,and Features of interest were more Strengthened between the member_age and duration by user_type strong correlation were present and lots of data point in the relationship.
Yes,colition of data points between features were out-bursting.
From the Exploratory Data Analysis above several findings and observation has been carried within one,two or more variables in the bike trip datasets to prove different outcome of interests. Firstly,knowing the most start_date for majority of bike shared in San Francisco and the most frequent Start_station located in San Francisco not only that, findings were also carried out using an histogram to know the distribution of Age and Duration(sec) and included the proportion rate using pie-chart for user_type and different Gender that bike was distributed to in San Francisco which happen to be more Male in which majority of those male where for the Subscriber user_type and several relationship were also carried out using scatter plot in the bivariate and multivariate to show relationship between various features such as member_age and duration_sec using User_type and member_gender as a point of preference.
https://stackoverflow.com/questions/26788854/pandas-get-the-age-from-a-date-example-date-of-birth https://github.com/jemc36/Udacity-DAND-DataVisualization-Ford-GoBike/blob/master/exploration.ipynb https://python.tutorialink.com/how-to-extract-month-name-and-year-from-date-column-of-dataframe/